attention parameter
- North America > United States (0.14)
- Europe > United Kingdom > England > Oxfordshire > Oxford (0.04)
- North America > United States > California > Santa Clara County > Mountain View (0.04)
- North America > United States > New York > Richmond County > New York City (0.04)
- North America > United States > New York > Queens County > New York City (0.04)
- (5 more...)
- North America > United States > Wisconsin (0.04)
- North America > United States > Texas (0.04)
- Europe > Germany > Saarland > Saarbrücken (0.04)
- Europe > Germany > North Rhine-Westphalia > Cologne Region > Cologne (0.04)
- North America > United States > California > Santa Clara County > Mountain View (0.04)
- North America > Canada > Quebec > Montreal (0.04)
- North America > United States > Wisconsin (0.04)
- North America > United States > Texas (0.04)
- Europe > Germany > Saarland > Saarbrücken (0.04)
- Europe > Germany > North Rhine-Westphalia > Cologne Region > Cologne (0.04)
BAM! Just Like That: Simple and Efficient Parameter Upcycling for Mixture of Experts
Zhang, Qizhen, Gritsch, Nikolas, Gnaneshwar, Dwaraknath, Guo, Simon, Cairuz, David, Venkitesh, Bharat, Foerster, Jakob, Blunsom, Phil, Ruder, Sebastian, Ustun, Ahmet, Locatelli, Acyr
The Mixture of Experts (MoE) framework has become a popular architecture for large language models due to its superior performance over dense models. However, training MoEs from scratch in a large-scale regime is prohibitively expensive. Existing methods mitigate this by pre-training multiple dense expert models independently and using them to initialize an MoE. This is done by using experts' feed-forward network (FFN) to initialize the MoE's experts while merging other parameters. However, this method limits the reuse of dense model parameters to only the FFN layers, thereby constraining the advantages when "upcycling" these models into MoEs. We propose BAM (Branch-Attend-Mix), a simple yet effective method that addresses this shortcoming. BAM makes full use of specialized dense models by not only using their FFN to initialize the MoE layers but also leveraging experts' attention parameters fully by initializing them into a soft-variant of Mixture of Attention (MoA) layers. We explore two methods for upcycling attention parameters: 1) initializing separate attention experts from dense models including all attention parameters for the best model performance; and 2) sharing key and value parameters across all experts to facilitate for better inference efficiency. To further improve efficiency, we adopt a parallel attention transformer architecture to MoEs, which allows the attention experts and FFN experts to be computed concurrently. Our experiments on seed models ranging from 590 million to 2 billion parameters demonstrate that BAM surpasses baselines in both perplexity and downstream task performance, within the same computational and data constraints.
- North America > United States (0.14)
- Europe > United Kingdom > England > Oxfordshire > Oxford (0.04)
GATE: How to Keep Out Intrusive Neighbors
Mustafa, Nimrah, Burkholz, Rebekka
Graph Attention Networks (GATs) are designed to provide flexible neighborhood aggregation that assigns weights to neighbors according to their importance. In practice, however, GATs are often unable to switch off task-irrelevant neighborhood aggregation, as we show experimentally and analytically. To address this challenge, we propose GATE, a GAT extension that holds three major advantages: i) It alleviates over-smoothing by addressing its root cause of unnecessary neighborhood aggregation. ii) Similarly to perceptrons, it benefits from higher depth as it can still utilize additional layers for (non-)linear feature transformations in case of (nearly) switched-off neighborhood aggregation. iii) By down-weighting connections to unrelated neighbors, it often outperforms GATs on real-world heterophilic datasets. To further validate our claims, we construct a synthetic test bed to analyze a model's ability to utilize the appropriate amount of neighborhood aggregation, which could be of independent interest.
- Europe > Austria > Vienna (0.14)
- North America > United States > Texas (0.05)
- North America > United States > Wisconsin (0.04)
- Europe > Germany > Saarland > Saarbrücken (0.04)
Generalizable and Stable Finetuning of Pretrained Language Models on Low-Resource Texts
Somayajula, Sai Ashish, Liang, Youwei, Singh, Abhishek, Zhang, Li, Xie, Pengtao
Pretrained Language Models (PLMs) have advanced Natural Language Processing (NLP) tasks significantly, but finetuning PLMs on low-resource datasets poses significant challenges such as instability and overfitting. Previous methods tackle these issues by finetuning a strategically chosen subnetwork on a downstream task, while keeping the remaining weights fixed to the pretrained weights. However, they rely on a suboptimal criteria for sub-network selection, leading to suboptimal solutions. To address these limitations, we propose a regularization method based on attention-guided weight mixup for finetuning PLMs. Our approach represents each network weight as a mixup of task-specific weight and pretrained weight, controlled by a learnable attention parameter, providing finer control over sub-network selection. Furthermore, we employ a bi-level optimization (BLO) based framework on two separate splits of the training dataset, improving generalization and combating overfitting. We validate the efficacy of our proposed method through extensive experiments, demonstrating its superiority over previous methods, particularly in the context of finetuning PLMs on low-resource datasets.
- North America > United States > California > San Diego County > San Diego (0.04)
- Europe > United Kingdom > England > Hampshire > Southampton (0.04)
- Europe > Romania > Sud - Muntenia Development Region > Giurgiu County > Giurgiu (0.04)
- Europe > Belgium > Brussels-Capital Region > Brussels (0.04)
- Research Report > New Finding (1.00)
- Research Report > Experimental Study (0.67)
Merging Decision Transformers: Weight Averaging for Forming Multi-Task Policies
Lawson, Daniel, Qureshi, Ahmed H.
Recent work has shown the promise of creating generalist, transformer-based, models for language, vision, and sequential decision-making problems. To create such models, we generally require centralized training objectives, data, and compute. It is of interest if we can more flexibly create generalist policies by merging together multiple, task-specific, individually trained policies. In this work, we take a preliminary step in this direction through merging, or averaging, subsets of Decision Transformers in parameter space trained on different MuJoCo locomotion problems, forming multi-task models without centralized training. We also demonstrate the importance of various methodological choices when merging policies, such as utilizing common pre-trained initializations, increasing model capacity, and utilizing Fisher information for weighting parameter importance. In general, we believe research in this direction could help democratize and distribute the process that forms multi-task robotics policies. Our implementation is available at https://github.com/daniellawson9999/merging-decision-transformers.
- North America > United States > Indiana > Tippecanoe County > West Lafayette (0.04)
- North America > United States > Indiana > Tippecanoe County > Lafayette (0.04)
- Asia > Middle East > Jordan (0.04)